This notebook uses a number of relatively simple quantitative metrics to operationalize the integrative complexity of numeral (i.e., number word) systems across languages, which can be taken as a measure of their morphological irregularity and capable of distinguishing relatively transparent numerals (e.g., English twenty-one which has a clear relationship to the numerals twenty and one) from less transparent ones (e.g., Hindi ɪkkis ‘21’ which has a less clear relationship to the numerals ek ‘1’ and bis ‘20’). These metrics show the following generalizations:
Data for cross-linguistic numeral systems were taken from two sources:
We used only data for the numerals 1–99 from both databases.
In line with the information theoretic principle of minimum description length (MDL; Rissanen 1983), we seek the shortest set of combinable elements needed to generate all 99 numeral words of interest. We use a simplified version of models employed for morpheme and word segmentation (Goldsmith 2001; Creutz and Lagus 2007; Goldwater, Griffiths, and Johnson 2009) using expectation maximization (Dempster, Laird, and Rubin 1977) in order to segment each numeral form in each language into recurrent subword units such that the set of segmented units is minimized.
For each language, we randomly initialize the segmentation of each numeral word form \(w\). We do not allow numerals 1–9 to be segmented (as they tend to be simplex forms except in the case of systems with bases lower than 10, which are underrepresented in the data sample); we allow 11–19 and multiples of 10 to either be unsegmented (as in less transparent, more fusional forms like En. twelve, Sp. cuarenta ‘40’) or have a single segmentation index \(i \in \{2,...,|w|-1\}\), where \(|w|\) represents word form length, and \(i\) the index marking the start of the second subword unit (as in more transparent forms like Gm. Acht-zehn ‘18’, Jp. ni-jū ‘20’). All other numerals are forced to have a single segmentation index (as above) or two segmentation indices \((i,j) \in \{2,...,|w|-1\}:j \neq i\) Segmented units are placed in a cache \(\mathbb{W}\).
For each EM iteration \(t\), we randomly consider each word \(w\) representing the numerals 10–99, removing the currently segmented units \(\sigma(w)_z^{(t-1)}\) from the cache \(\mathbb{W}\). We then choose a new segmentation \(z^{(t)}\) (either no segmentation, one index, or two, depending on the conditions outlined above) such that \(\text{arg min}_{z^{(t)}} \mathbb{W} \cup \sigma(w)_{z^{(t)}}\). We stop when the description length reaches a minimum, or after 1000 iterations. Because this version of the EM algorithm converges on local and not global optima, we run this procedure 10 times per language, storing the shortest description length across runs. We note that this procedure does not account for allophonic processes, and may infer different subword elements that are underlyingly the same according to standard phonological analyses due to allophony, or orthographic variation.
This procedure yields a single MDL value for each language, representing the complexity of the language’s numeral system as a whole.
We additionally operationalize the complexity of numeral systems using n-gram (specifically segmental/grapheme trigram) continuation surprisal, representing the unpredictability of phoneme or grapheme in context, i.e., given the two previous phonemes/graphemes (Piantadosi, Tily, and Gibson 2012; Dautriche et al. 2017), averaged within and across words. Numeral systems containing more recurrent, predictable elements are expected to exhibit lower surprisal. We compute the mean n-gram surprisal of individual numeral words conditioned on all other words in the system (training vs. held-out).
For each language, we compute the held-out trigram surprisal of each word form as follows. First, we prepend two instances of the beginning of string sequence
This procedure yields a surprisal value for each numeral in the system, conditioned on the other members of the system. These values can be averaged to represent the overall surprisal of the system at the language level.
We use a simplified version of the linear mapping approach (proposed in Baayen, Chuang, and Blevins 2018 et seq) to model the production of numeral forms (e.g. twelve) given an underlying semantic representation ({TENS=1,DIGITS=2}). Unlike the original version of this model, numeral forms’ meanings are represented by two one-hot vectors (comprising the tens- and digits-place value of a numeral) rather than continuous semantic vectors, which are concatenated together and make up rows of the meaning matrix \(\boldsymbol{S}\). Numeral forms are represented by a vector containing the counts of the trigrams they contain, which make up rows of the word form matrix \(\boldsymbol{W}\).
Form generation given a semantic representation \(\boldsymbol{s}_i\) proceeds as follows. We compute the least-squares solutions \(\boldsymbol{\hat{\beta}_{sw}}\) and \(\boldsymbol{\hat{\beta}_{ws}}\) that solves the equations \(\boldsymbol{S}_{-i}\boldsymbol{\hat{\beta}_{sw}} = \boldsymbol{W}_{-i}\) and \(\boldsymbol{W}_{-i}\boldsymbol{\hat{\beta}_{ws}} = \boldsymbol{S}_{-i}\), respectively.
We then compute a vector of weights \(\boldsymbol{s}_i^{\top} \boldsymbol{\hat{\beta}_{sw}}\), representing the association strengths of different trigrams (in reality their predicted counts, but in real-valued space) with the semantic representation \(\boldsymbol{s}_i\). We decode the form via beam search: starting with trigrams beginning with the start-of-sequence token, we consider the two trigrams forming valid continuations of the sequence with highest association strength, stopping when the end-of-sequence token is reached. For each candidate form \(\boldsymbol{\hat{w}}\), we compute the predicted meaning \(\boldsymbol{\hat{w}}^{\top} \boldsymbol{\hat{\beta}_{ws}}\), choosing the candidate form that shows highest correlation (Pearson’s \(r\)) with \(\boldsymbol{s}_i\). We measure the error rate between the predicted form and the true form using the Levenshtein distance between the two forms divided by the length of the longer form.
To model the comprehension or discrimination of the meaning of a given form, we train two multinomial logistic classifers on the word form matrix \(\boldsymbol{W}_{-i}\) in order to predict the tens and digits label of \(\boldsymbol{w}_i\) with maximum probability according to the classifiers. We treat classification accuracy as a binary variable valued \(1\) if the tens and digits label are correctly predicted and \(0\) otherwise.
Here, visualizations for different complexity metrics based on the UniNum dataset are given. Chinese varieties excluded from the surprisal map due to inflated surprisal values resulting from their ideographic writing system.
Below, first principal component resulting from PCA performed on these variables is displayed on a map:
PC1 explains 0.999 of the variance in the data, and shows a strong correlation with MDL.
Predicted complexity values for South Asia vs. rest of world:
Here, visualizations for different complexity metrics based on the augmented SAND dataset are given.
Below, the first principal component resulting from PCA performed on these variables is displayed on a map:
PC1 explains 1 of the variance in the data, and shows a strong correlation with MDL.
Below, the first principal component is shown for all Indo-Aryan languages, with language labels (n.b. that some choices of coordinates, e.g., for Sanskrit, may not provide a good indication of the region in which languages are/were spoken):
Predicted complexity values for Indo-Aryan vs. non-Indo-Aryan:
Here, we list complexity values for individual Indo-Aryan languages, ordered according to PC1:
| Name | PC1 | MDL | surprisal | PER | accuracy |
|---|---|---|---|---|---|
| Hajong | -21.7113033 | 30 | 0.9646738 | 0.2748347 | 0.8888889 |
| Kudmali | -13.7158920 | 38 | 0.9265155 | 0.1034741 | 0.8989899 |
| Indus Kohistani | -12.7137022 | 39 | 0.9252964 | 0.1777522 | 0.6464646 |
| Shina | -10.7142756 | 41 | 0.9157767 | 0.2882573 | 0.7171717 |
| Rohingya | -10.7097736 | 41 | 1.1734988 | 0.2519170 | 0.8383838 |
| Palula | -5.7208720 | 46 | 0.7473512 | 0.2228144 | 0.8282828 |
| Torwali | -4.7189559 | 47 | 0.8675831 | 0.1828304 | 0.8484848 |
| Sinhala | -3.7159348 | 48 | 1.0228114 | 0.2102460 | 0.8888889 |
| Sanskrit | -3.7058906 | 48 | 1.3595579 | 0.3908640 | 0.7272727 |
| Phalura | -1.7236683 | 50 | 0.7151789 | 0.1462739 | 0.8686869 |
| Lambadi | -1.7119757 | 50 | 1.2089042 | 0.1939510 | 0.8181818 |
| Domaaki | -0.7166503 | 51 | 0.7925570 | 0.2084482 | 0.2121212 |
| Kashmiri | -0.6998324 | 51 | 1.5990529 | 0.4662076 | 0.5858586 |
| Halbi | 0.2791241 | 52 | 0.8223804 | 0.1777302 | 0.7676768 |
| Brokskad | 2.2801786 | 54 | 0.8609005 | 0.2672853 | 0.7474747 |
| Kalasha | 5.2773222 | 57 | 0.8291410 | 0.1721655 | 0.8181818 |
| Saraiki | 9.2981383 | 61 | 1.6220775 | 0.5230163 | 0.5757576 |
| Gaddi | 9.3044439 | 61 | 1.9018381 | 0.5149403 | 0.5555556 |
| Pahari | 10.3034984 | 62 | 1.8754042 | 0.5291076 | 0.5757576 |
| Sindhi | 12.2987308 | 64 | 1.7280754 | 0.4603491 | 0.6262626 |
| Sylheti | 13.2981560 | 65 | 1.7159670 | 0.4379585 | 0.6060606 |
| Dogri | 13.3085068 | 65 | 2.0900982 | 0.5133833 | 0.4141414 |
| Dhivehi | 15.2871934 | 67 | 1.2829792 | 0.3996908 | 0.6666667 |
| Konkani | 15.3007209 | 67 | 1.8054765 | 0.4483577 | 0.4646465 |
| Nagpuri | 19.3004819 | 71 | 1.8868177 | 0.4430827 | 0.5656566 |
| Kulvi | 19.3062491 | 71 | 2.0972168 | 0.4965492 | 0.4747475 |
| Nepali | 20.2977425 | 72 | 1.7958055 | 0.4950057 | 0.6666667 |
| Pakistani Punjabi | 20.3025167 | 72 | 1.9261491 | 0.4812200 | 0.4141414 |
| Urdu | 21.3012741 | 73 | 1.8995732 | 0.4986164 | 0.4747475 |
| Garhwali | 22.2988057 | 74 | 1.8557361 | 0.4432907 | 0.5757576 |
| Indian Punjabi | 22.3021722 | 74 | 1.9742159 | 0.4576899 | 0.4949495 |
| Kumaoni | 23.2985458 | 75 | 1.8570291 | 0.4452969 | 0.5757576 |
| Oriya | 23.2997747 | 75 | 1.9242077 | 0.4530390 | 0.6161616 |
| Panchparganiya | 24.2970098 | 76 | 1.7924628 | 0.5057436 | 0.6060606 |
| Gujarati | 26.3018633 | 78 | 1.9887736 | 0.5379418 | 0.4949495 |
| Khortha | 27.2989620 | 79 | 1.9119964 | 0.4689459 | 0.5454545 |
| Bengali | 28.2975929 | 80 | 1.8402168 | 0.5197726 | 0.5252525 |
| Assamese | 29.2925140 | 81 | 1.6544109 | 0.4402032 | 0.5353535 |
| Wagdi | 29.3031549 | 81 | 2.0974679 | 0.5198141 | 0.5050505 |
| Marathi | 30.2974201 | 82 | 1.8531737 | 0.5289624 | 0.5151515 |
| Marwari | 32.3032315 | 84 | 2.1431204 | 0.5380146 | 0.5252525 |
| Rajbongshi | 33.2973761 | 85 | 1.8678471 | 0.5706885 | 0.4848485 |
| Bagri | 34.2946482 | 86 | 1.8224834 | 0.4731891 | 0.5757576 |
Relationships between complexity, elevation, and vigesimality are visualized below:
Predicted vigesimality values from a fitted model (gam(PC1 ~ vigesimal + s(elevation), data=merged[merged$Family=='Indo-Aryan',])):
Predicted slope of smooth function representing effect of elevation on PC1 from same model:
(selected languages)
GAM smooths
Marginal slopes
Marginal effects of group
(selected languages)
GAM smooths
Marginal slopes
Marginal effects of group
(selected languages)
GAM smooths
Marginal slopes
Marginal effects of group
Correlations between surprisal values computed from the UniNum dataset under different smoothing constants are shown below:
Correlations between surprisal values computed from the SAND dataset under different smoothing constants are shown below:
Baayen, R Harald, Yu-Ying Chuang, and James P Blevins. 2018. “Inflectional Morphology with Linear Mappings.” The Mental Lexicon 13 (2): 230–68.
Creutz, Mathias, and Krista Lagus. 2007. “Unsupervised Models for Morpheme Segmentation and Morphology Learning.” ACM Transactions on Speech and Language Processing (TSLP) 4 (1): 1–34.
Dautriche, Isabelle, Kyle Mahowald, Edward Gibson, Anne Christophe, and Steven T Piantadosi. 2017. “Words Cluster Phonetically Beyond Phonotactic Regularities.” Cognition 163: 128–45.
Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. “Maximum Likelihood from Incomplete Data via the Em Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22.
Goldsmith, John. 2001. “Unsupervised Learning of the Morphology of a Natural Language.” Computational Linguistics 27 (2): 153–98.
Goldwater, Sharon, Thomas L Griffiths, and Mark Johnson. 2009. “A Bayesian Framework for Word Segmentation: Exploring the Effects of Context.” Cognition 112 (1): 21–54.
Kumari, Mamta. 2023a. “South Asian Numerals Database (Sand).” Leipzig: Max Planck Institute for Evolutionary Anthropology.
———. 2023b. “CLDF Dataset Derived from Mamta’s "South Asian Numerals Database" from 2023.” Zenodo. https://doi.org/10.5281/zenodo.10033151.
Piantadosi, Steven T, Harry Tily, and Edward Gibson. 2012. “The Communicative Function of Ambiguity in Language.” Cognition 122 (3): 280–91.
Rissanen, Jorma. 1983. “A Universal Prior for Integers and Estimation by Minimum Description Length.” The Annals of Statistics 11 (2): 416–31.
Ritchie, Sandy, Richard Sproat, Kyle Gorman, Daan van Esch, Christian Schallhart, Nikos Bampounis, Benoı̂t Brard, Jonas Fromseier Mortensen, Millie Holt, and Eoin Mahon. 2019. “Unified Verbalization for Speech Recognition & Synthesis Across Languages.” In INTERSPEECH, 3530–4.